Biostat 200B Homework 6

Due Feb 28 @ 11:59PM

Author

Ziheng Zhang_606300061

Question 1

Answer:

Question 2

For the SENIC data set, consider the model regressing loglength on xray, census and age.

  1. Plot studentized residual, leverage and Cooks D by id and identify observations that have unusual leverage, residual or Cook’s D. You can make three separate plots or produce a “bubble” plot.
  2. Identify the two hospitals with the highest Cook’s D. By examining influence statistics such as leverage and residual, and log(length) and predictor values, can you discover what are the characteristics of these hospitals that make them potentially influential (that is, give them a high Cook’s D)?
  3. Rather than rely completely on influence measures, it is better to conduct sensitivity analyses to see whether our inferences really do depend on one or a few unusual observations. Run the model after dropping these observations. (We are doing this to see what happens. I do not recommend that you automatically drop influential observations! Rather, you need to do an investigation.) How do the results compare to the results when fitting the model to the full data set? Compare regression coef estimates, p-values and root MSE.
  4. Conduct model diagnostics for the partial relationships in the model using component-plus-residual plots. Decide whether any of the predictors should be transformed to improve the model. Provide the output for your final model.

Answer:

a.
The SAS codes are as follows:

The leverage plot is as follows:

Observations with IDs 104, 112 have high leverage.

The studentized residual plot is as follows:

Observations with ID 47 have high studentized residual.

The Cook’s D plot is as follows:

Observations with IDs 47, 112 appear to have high influence. They have Cook’s D of 0.23 and 0.18, respectively.

b.
Observation with ID 47 has high influence because it has high studentized residual. And after we check its log(length) value, we find that it is 2.9735 and it is the maximum among all observations. The distribution of log(length) is right-skewed, and this observation is an outlier. This means average length of stay of patients in hospital 47 (in days) is much longer than other hospitals.
This is the picture of the log(length) distribution:

This is the picture of the log(length) distribution with ID = 47:

Observation with ID 112 has high influence because it has high leverage. And after we check its three predictors’ values (xray, census and age), we find that census has extremely high value, 791, and it is the maximum among all observations. The distribution of census is right-skewed, and this observation is an outlier. This means average number of patients in hospital 112 per day during study period is much larger than other hospitals.
This is the picture of the census distribution:

This is the picture of the census distribution with ID = 112:

In brief, the two hospitals with the highest Cook’s D are busy and crowded.

c.
The SAS codes are as follows:

The results with unusual observations are as follows:

After dropping the two unusual observations, the results are as follows:

Coefficient of xray (with unusual observations): \(0.00330\), p-value \(<0.0001\).
Coefficient of xray (without unusual observations): \(0.00283\), p-value \(<0.0001\).
Coefficient of census (with unusual observations): \(0.00054446\), p-value \(<0.0001\).
Coefficient of census (without unusual observations): \(0.00045298\), p-value \(<0.0001\).
Coefficient of age (with unusual observations): \(0.00812\), p-value = \(0.0072\).
Coefficient of age (without unusual observations): \(0.00578\), p-value = \(0.0385\).
Root MSE (with unusual observations): \(0.13975\).
Root MSE (without unusual observations): \(0.12799\).
Therefore, coefficient estimates, p-values and root MSE are almost the same. This means our inferences do not depend on one or a few unusual observations.

d. The SAS codes are as follows:

The parameter estimates are as follows:

The component-plus-residual plot for age is as follows:

The component-plus-residual plot for xray is as follows:

The component-plus-residual plot for census is as follows:

So it seems that age does not need to transform, but xray and census need to transform. It looks like a quadratic polynomial might work for xray and we need to go down ladder in x (census). So we use a log transformation for census and a quadratic polynomial for xray.

The transformation SAS codes are as follows:

The new component-plus-residual plot for xray is as follows:

The new component-plus-residual plot for census is as follows:

These two plots look better than before. So we use the transformed predictors to fit the model and the final model is as follows:

\[ log(length) = 1.00948 + 0.11202 \times log(census) + 0.00953 \times age + 0.00115 \times xray + 0.00001223 \times xray^2 \]

Question 3

For the SENIC data set, use best subsets selection to select a best model. The outcome variable is risk.

  1. The pool of candidate predictors is region, beds, services, medsch, xray, length and a new variable created as nurses/census (nurse/patient ratio).
  2. Before model selection, log-transform positively skewed predictors.
  3. Consider Cp, AIC and SBC/BIC to select among models.
  4. To present your results, make a table listing the top models (about 5-8 models, using your discretion) and the values of their model selection criteria. Briefly describe the results and which model appears to be the “best”.

Answer:

After creating the new variable nurses/census, we check the distribution of all predictors. The distribution of beds, length and nurses/census is right-skewed, and we log-transform it. The SAS codes and results are as follows:

The result tables are as follows:

library(kableExtra)
model_data <- data.frame(
  ID = c(1, 2, 3, 4, 5, 6),
  Models = c("regnc regs regw logbeds svcs msch xray loglength lognuratio", 
             "regnc regw logbeds svcs msch xray loglength lognuratio", 
             "regw logbeds svcs msch xray loglength lognuratio", 
             "regw logbeds msch xray loglength lognuratio", 
             "regw logbeds xray loglength lognuratio", 
             "logbeds xray loglength lognuratio"),
  Cp = c(10.0000, 8.0627, 6.1141, 4.2378, paste0(3.3371, "*"), 5.6439),
  AIC = c(116.3372, 114.4060, 112.4623, 110.5978, paste0(109.7947, "*"), 
          112.3654),
  SBC = c(28.6111, 23.9525, 19.2814, 14.6895, 11.1590, paste0(11.0023, "*"))
)
kable(model_data, "html") |>
  kable_styling(full_width = F)
library(kableExtra)
library(olsrr)
library(readr)
library(tidyverse)
senic_data <- read_csv("/Users/zihengzhang/Downloads/200B/senic.csv")
senic_transformed <- senic_data |>
  mutate(nuratio = nurses/census) |> 
  mutate(logbeds = log(beds), loglength = log(length), 
         lognuratio = log(nuratio)) |>
  mutate(region = factor(region)) |>
  mutate(msch = ifelse(msch == 2, 0, msch)) |>
  mutate(msch = factor(msch)) |>
  select(risk, region, logbeds, svcs, msch, xray, loglength, lognuratio)

senic_model <- lm(risk ~ ., data = senic_transformed)

senic_select <- ols_step_all_possible(senic_model)

senic_table <- senic_select$result |>
  select(predictors, cp, aic, sbc) |>
  arrange(cp) |> 
  slice(1:8) |>
  mutate(id = seq(1, 8, 1)) |>
  select(id, everything()) |>
  kable("html") |>
  kable_styling(full_width = FALSE)

senic_table
id predictors cp aic sbc
1 logbeds xray loglength lognuratio 5.643898 320.0455 336.4098
2 logbeds msch xray loglength lognuratio 6.956702 321.3285 340.4202
3 region logbeds xray loglength lognuratio 7.199676 321.3259 345.8724
4 logbeds svcs xray loglength lognuratio 7.268830 321.6547 340.7464
5 region logbeds msch xray loglength lognuratio 8.085825 322.1114 349.3853
6 logbeds svcs msch xray loglength lognuratio 8.733038 323.0941 344.9132
7 region logbeds svcs xray loglength lognuratio 8.976461 323.0835 350.3574
8 region logbeds svcs msch xray loglength lognuratio 10.000000 324.0173 354.0186
summary(lm(risk ~ logbeds + xray + loglength + lognuratio, 
           data = senic_transformed))

Call:
lm(formula = risk ~ logbeds + xray + loglength + lognuratio, 
    data = senic_transformed)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7345 -0.6425 -0.0777  0.5684  2.8133 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.784931   1.179192  -4.906 3.31e-06 ***
logbeds      0.509710   0.142622   3.574 0.000527 ***
xray         0.016694   0.005339   3.127 0.002270 ** 
loglength    2.757251   0.649407   4.246 4.63e-05 ***
lognuratio   1.040605   0.276438   3.764 0.000272 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9673 on 108 degrees of freedom
Multiple R-squared:  0.4982,    Adjusted R-squared:  0.4796 
F-statistic: 26.81 on 4 and 108 DF,  p-value: 1.874e-15
Here we use backward stepwise selection to select the best model. The best model is the one with the smallest Cp, AIC, and SBC. In the table, the smallest values are marked with a star. Usually, it should be the same model. But in this case, the smallest Cp and AIC is the fifth model, while the smallest SBC is the sixth model. If we check other metrics, such as BIC and PRESS, we can find that the fifth model is the best model. The best model is as follows:

\[\begin{align*} risk &= -6.784094 + 0.613906 \times regw + 0.509132 \times log(beds) + 0.016971 \times xray + \\ &\quad 3.145076 \times log(length) + 0.853615 \times log(nurses/census) \end{align*}\]

\[\begin{align*} risk &= -5.784931 + 0.509710 \times log(beds) + 0.016694 \times xray + 2.757251 \times log(length) + \\ &\quad 1.040605 \times log(nurses/census) \end{align*}\]